STOAT is a versatile GWAS (Genome-Wide Association Study) tool designed to work with pangenome graphs.
It supports binary phenotypes both with and without covariates: using Fisher’s exact test or Chi-squared test when no covariates are present, and logistic regression when covariates are included.
For quantitative traits, STOAT performs linear regression, again with or without covariates. Additionally, it supports eQTL analysis, enabling association testing between genetic variants and gene expression levels.
It containt 2 modes :
Stoat:
position, snarl_id,
type).Stoat VCF:
Linear Regression: Corrected near-perfect collinearity case by merging identical columns.
Logistic Regression: Corrected output format.
Other Fixes:
Stoat Graph:
We constructed three simulated datasets: binary, quantitative, and eQTL.
The binary and quantitative simulations each include their own pangenome graph representing a single chromosome, incorporating various types of variation, such as SNPs, INDELs, and complex variants. Despite the different variations, the graph structure—based on a fork pattern—remains consistent across all simulations.
A fork structure is defined as a boundary snarl with exactly two paths, which can themselves contain nested forks (see fork representation below).
Once the pangenome graph was constructed, we simulated individual haplotypes by generating sample genomes that traverse different paths within the graph. To introduce phenotype-genotype relationships, we assigned each haplotype a binary or quantitative phenotype according to specific simulation rules.
The eQTL simulation differs in that it includes 10 chromosomes, but no pangenome graph is constructed. Instead, we directly simulate the path-snarl file required by STOAT, and the variations consist only of simple SNPs.
| Type | Number of samples | Number of variant type (SNP/INDEL/COMPLEX) | Number of snarl/paths |
|---|---|---|---|
| Binary | 200 | 2444/446/158 | 1524/3048 |
| Quantitative | 200 | 2478/382/138 | 1499/2998 |
| eqtl | 200 | 200000/0/0 | 100000/200000 |
200 samples were divided into two cohorts (100 cases and 100 controls), each corresponding to a different phenotypic state containing 1,000 variations. Each group had a correlated probability of traversing a specific path within each snarl. This probability (e.g., 50/50) could be equal between the two cohorts or skewed in favor of one group (e.g., 20/80), simulating an association between variation and phenotype.
Similar to the binary simulation, 200 samples were divided into two cohorts, with each group having a correlated probability of traversing specific paths within snarls. In this case, an additional probability factor was introduced so that the likelihood of passing through a given path depends on the individual’s phenotype value.
In this simulation, we generated 200 samples with 200,000 SNPs and 100 genes. All variants were generated randomly, meaning no significant associations were introduced. The goal was to create a simple simulation to test STOAT’s VCF-based eQTL pipeline.
Covariates were simulated based on the phenotype, except for sex. They include SEX, PC1, PC2, and PC3. The SEX covariate is randomly assigned as male or female, while the other PCs are simulated from a normal distribution.
The binary and quantitative simulations produce a file containing the following elements:
If two groups have identical probabilities on the same edge, it may appear significant by chance but will be treated as non-significant in downstream analysis. Conversely, a difference in frequency between the two groups, even if small, is treated as significant.
Example freq file:
start_node next_node group freq
2 3 0 0.53
2 3 1 0.53
2 4 0 0.47
2 4 1 0.47
To assess whether a snarl is correctly identified as significant, we
merge the snarl’s path information with the frequency probabilities from
the freq file.
A correct match is a snarl that contains both start/end node pairs in two separate paths (one for each group).
##
## Pourcentage of paths (present in freq file) tested : 95.99737532808399
##
## Pourcentage of paths (present in freq file) tested : 89.56692913385827
##
## Pourcentage of paths (present in freq file) tested : 98.9501312335958
## Column Num_Different_Rows
## 1 CHR 0
## 2 START_POS 0
## 3 END_POS 0
## 4 PATH_LENGTHS 0
## 5 P_FISHER 1295
## 6 P_CHI2 1439
## 7 P_ADJUSTED 1449
## 8 GROUP_PATHS 1451
## 9 DEPTH 0
## 10 POS 0
This change can be explain because stoat graph use ‘ref’ haplotype as an correct haplotype where in the simulation it’s not.
##
## Pourcentage of paths (present in freq file) tested : 90.86057371581055
##
## Pourcentage of paths (present in freq file) tested : 90.86057371581055